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Abstract. Distributed systems are now both very large and highly dynamic. Peer 
to peer overlay networks have been proved efficient to cope with this new deal 
that traditional approaches can no longer acconmiodate. While the challenge of 
organizing peers in an overlay network has generated a lot of interest leading to a 
large number of solutions, maintaining critical data in such a network remains an 
open issue. In this paper, we are interested in defining the portion of nodes and 
frequency one has to probe, given the churn observed in the system, in order to 
achieve a given probability of maintaining the persistence of some critical data. 
More specifically, we provide a clear result relating the size and the frequency of 
the probing set along with its proof as well as an analysis of the way of leveraging 
such an information in a large scale dynamic distributed system. 

Keywords: Churn, Core, Dynamic system. Peer to peer system. Persistence, Prob- 
abilistic guarantee. Quality of service. Survivability. 



1 Introduction 

Context of the paper. Persistence of critical data in distributed applications is a crucial 
problem. Although static systems have experienced many solutions, mostly relying on 
defining the right degree of replication, this remains an open issue in the context of 
dynamic systems. 

Recently, peer to peer (P2P) systems became popular as they have been proved 
efficient to cope with the scale shift observed in distributed systems. A P2P system is a 
dynamic system that allow peers (nodes) to join or leave the system. In the meantime, 
a natural tendency to trade strong deterministic guarantees for probabilistic ones aimed 
at coping with both scale and dynamism. Yet, quantifying bounds of guarantee that can 
be achieved probabilistically is very important for the deployment of applications. 

More specifically, a typical issue is to ensure that despite dynamism some critical 
data is not lost. The set of nodes owning a copy of the critical data is called a core 
(distinct cores can possibly co-exist, each associated with a particular data). 

Provided that core nodes remain long enough in the system, a "data/state transfer" 
protocol can transmit the critical data from nodes to nodes. This ensures that a new 
core of nodes in the system will keep track of the data. Hence, such protocols provide 



data persistence despite the uncertainty of the system state involved by the dynamic 
evolution of its members. 

There is however an inherent tradeoff in the use of such protocols. If the policy 
that is used is too conservative, the data transfer protocol might be executed too often, 
thereby consuming resources and increasing the whole system overhead. Conversely, if 
the protocol is executed too rarely, all nodes owning a copy of the data may leave (or 
crash) before a new protocol execution, and the data would be lost. This fundamental 
tradeoff is the main problem addressed in this paper. 

Content of the paper. Considering the previous context, we are interested in providing 
some probabilistic guarantees of maintaining a core in the system. More precisely, given 
the churn observed in the system, we aim at maintaining the persistence of some critical 
data. To this end, we are interested in defining the portion of nodes that must be probed, 
as well as the frequency to which this probe must occur to achieve this result with a 
given probability. This boils down to relating the size and the frequency of the probing 
set according to a target probability of success and the churn observed in the system. 

The investigation of the previous tradeoff relies on critical parameters. One of them 
is naturally the size of the core. Two other parameters are the percentage of nodes that 
enter/leave the system per time unit, and the duration during which we observe the 
system. We first assume that, per time unit, the number of entering nodes is the same as 
the number of leaving nodes. In other words, the number of nodes remains constant. 

Let 5* be the system at some time r. It is composed of n nodes including a subset 
of q nodes defining a core Q for a given critical data. Let S' be the system at time 
T + 5. Because of the system evolution, some nodes owning a copy of the critical data 
at time r might have left the system at time t + 5 (those nodes are in S and not in 5')- 
So, an important question is the following: "Given a set Q' of q' nodes of 5", what is 
the probability that Q and Q' do intersect?" We derive an explicit expression of this 
probability as a function of the parameters characterizing the dynamic system. This 
allows us to compute some of them when other ones are fixed. This provides distributed 
applications with the opportunity to set a tradeoff between a probabilistic guarantee of 
achieving a core and the overhead involved computed either as the number of nodes 
probed or the frequency at which the probing set needs to be refreshed. 

Related work. As mentioned above, P2P systems have received a great deal of atten- 
tion both in academia and industry for the past five years. More specifically, a lot of 
approaches have been proposed to create whether they are structured, such as Chord 
|18|, CAN 1 15 1 or Pastry |]T6l, or unstructured II5I6I9I . Maintenance of such overlay 
networks in the presence of high churns has also been studied as one of the major goal 
of P2P overlay networks 1 10|. The parameters impacting on connectivity and routing 
capabilities in P2P overlay networks are now well understood. 

In structured P2P networks, routing tables contain critical information and refresh- 
ment must occur with some frequency depending on the churn observed in the net- 
work 12 to achieve routing capabilities. For instance in Pastry, the size of the leaf set 
(set of nodes whose identities are numerically the closest to current node identity) and 
its maintenance protocol can be tuned to achieve the routing within reasonable delay 
stretch and low overhead. Finally, there has been approaches evaluating the number of 



locations to which a data has to be replicated in the system in order to be successfully 
searched by flooding-based or random walk-based algorithms H. These approaches do 
not consider specifically churn in their analysis. In this paper churn is a primary concern. 
The result of this work can be applied to any P2P network, regardless of its structure, 
in order to maintain critical data by refreshment at sufficiently many locations. 

The use of a base core to extend protocols designed for static systems to dynamic 
systems has been investigated in [14]. Persistent cores share some features with quo- 
rums (i.e., mutually intersecting sets). Quorums originated a long time ago with major- 
ity voting systems |7T9l introduced to ensure data consistency. More recently, quorum 
reconfiguration [ 1 L3J have been proposed to face system dynamism while guarantee- 
ing atomic consistency: this application outlines the strength of such dynamic quorums. 
Quorum-based protocols for searching objects in P2P systems are proposed in [T31. 
Probabilistic quorum systems have been introduced in lil2J . They use randomization to 
relax the strict intersection property to a probabilistic one. They have been extended to 
dynamic systems in HI . 

Roadmap. The paper is organized as follows. Section |2] defines the system model. 
Section|3]describes our dynamic system analysis and our probabilistic results. Section|4] 
interprets the previous formulas and shows how to use them to control the uncertainty 
of the key parameters of P2P applications. Finally, Section|5]concludes the paper. 

2 System model 

The system model, sketched in the introduction is simple. The system consists of n 
nodes. It is dynamic in the following sense. For the sake of simplicity, let n be the size 
of the system. Every time unit, cn nodes leave the system and cn nodes enter the system, 
where c is the percentage of nodes that enter/leave the system per time unit; this can be 
seen as new nodes "replacing" leaving nodes. Although monitoring the leave and join 
rates of a large-scale dynamic system remains an open issue, it is reasonable to assume 
join and leave are tightly correlated in P2P systems. A more realistic model would take 
in account variation of the system size depending for instance, on night-time and day- 
time as observed in 1 17|. 

A node leaves the system either voluntarily or because it crashes. A node that leaves 
the system does not enter it later (Practically, this means that, to re-enter the system, 
a node that has left must be considered as a new node; all its previous knowledge of 
the system state is lost.) For instance, initially (at time t), assume there are n nodes 
(identified from 1 to n; let us take n = 5 to simplify). Let c = 0.2, which means that, 
every time unit, nc — 1 node changes (a node disappears and a new node replaces it). 
That is, at time r + 1, one node leaves the system and another one joins. From now 
on, observe that next leaving nodes are either nodes that were initially in the system or 
nodes that joined after time r. 

3 Relating the key parameters of the dynamic system 

This section answers the question posed in the introduction, namely, given a set Q{t) 
of nodes at time r (the core), and a set Q{t') of nodes at time t' — t ^ 5, what is the 



probability of the event ''Q{t) nQ{T') ^ 0". In the remaining of this paper, we assume 
that both Q{t) and Q{t') contain q nodes, since an interesting goal is to minimize both 
the number of nodes where the data is replicated and the number of nodes one has to 
probe to find the data. Let an initial node be a node that belongs to the system at time 
r. Moreover, without loss of generality, let t = (hence, r' = 6). 

Lemma 1. Let C be the ratio of initial nodes that are replaced after 5 time units. We 
have C = 1 - (1 - cY . 

Proof We claim that the number of initial nodes that are still in the system after 5 time 
units is 7i(l — c)*. The proof is by induction on the time instants. Let us remind that c is 
the percentage of nodes that are replaced in one time unit. For the Base case, at time 1, 
n — nc = n(l — c) nodes have not been replaced. For the induction case, let us assume 
that at time 5—1, the number of initial nodes that have not been replaced is n(l — c)''"^. 
Let us consider the time instant 5. The number of initial nodes that are not replaced after 
5 time units is n(l — c)*^^ — n(l — cY~^c, i.e., n(l — c)*, which proves the claim. 
It follows from the previous claim that the number of initial nodes that are replaced 
during 5 time units is n — n(l — cY . Hence, C — [n — n{l — cY)/n = 1 — (1 — cY ■ 



Given a core of q nodes at time r (each having a copy of the critical data), the 
following theorem gives the probability that, at time t' — t + S, an arbitrary node 
cannot obtain the data when it queries q nodes arbitrarily chosen. 

For this purpose, using result of Lemma[T|we take the number of elements that have 
left the system during the period S as a = \Cn] = \{1 - (1 - cY)n]. This number 
allows us to evaluate the aforementioned probability. 

Theorem 1. Let xi, Xq be any node in the system at time t' = t + S. The probability 
that none of these nodes belong to the initial core is 



where a ^ [(1 — (1 — c)'^)n], a — max(0, a — n + q), and b = min(Q!, q). 

Proof The problem we have to solve can be represented in the following way: 

The system is an urn containing n balls (nodes), such that, initially, q balls are green 
(they represent the initial core Q{t) and are represented by the set Q in FigurefTli, while 
the n — q remaining balls are black. 

We randomly draw a = \Cri \ balls from the urn (according to a uniform distribu- 
tion), and paint them red. These a balls represent the initial nodes that are replaced by 
new nodes after S units of time (each of these balls was initially green or black). After 
it has been colored red, each of these balls is put back in the urn (so, the urn contains 
again n balls). 

We then obtain the system as described in the right part of Figure [T] (which repre- 
sents the system state at time t' — t + 6). The set A is the set of balls that have been 
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painted red. Q' is tiie core set Q after some of its balls have been painted red (these balls 
represent the nodes of the core that have left the system). This means the set Q' \ A, 
that we denote by £, contains all the green balls and only them. 

We denote by (3 the number of balls in the set Q' n A. It is well-known that (3 has 
a hypergeometric distribution, i.e., for a < k < b where a = max(0, a — n + q) and 
b = min(Q;, q), we have 



Pr[/3 = k] 



(1) 



We finally draw randomly and successively q balls xi, ...,Xq from the urn (system 
at time r') without replacing them. The problem consists in computing the probability 



of the event {none of the selected balls xi, . 
Pr[xi i E, ...,Xq i £]. 



are green}, which can be written as 




The system at time r 



The system at time r' 



Fig. 1. The system at times r and t' — t + 5 

As{a; e £} <^ {x G Q'} C^ {x ^ Q'n A\, we have (taking the contrapositive) 
{x ^ E} ^ {x ^ Q'} U {x e Q' n A}, from which we can conclude Pr[x ^ £] = 
Pr[{a; ^ Q'} U {x e Q' n A}]. As the events {x ^ Q'} and {a; G Q' n A\ are 
disjoints, we obtain Pr[a:: ^ E\= Vx\x ^ Q!\ + Pr[a; G Q' H y^]. The system contains 
n balls. The number of balls in Q', A and Q' fl ^ is equal to g, a and /3, respectively. 
Since there is no replacement, we get. 



pr[xi ^ E, ^ £ / /3 = fc] = eU nLi (1 - ;StT ) = EL. 



n — q + k 



.(2) 



To uncondition the aforementioned result (|2]i, we simply multiply it by leading to 



Pr[a;i ^ £, ...,Xg i £] = 



Z-^k—a 



n + k~q\/'q\/'n — q 
q J \k J \a — k 



n \ I n 
q) \a 



^Theorem\\\ 



4 From formulas to parameter tuning 



In the previous section, we have provided a set of formulas that can be leveraged and 
exploited by distributed applications in many ways. Typically, in a P2P system, the 
churn rate is not negotiated but observecQ. Nevertheless, applications deployed on P2P 
overlays may need to choose the probabilistic guarantees that a node of the initial core is 
probed. Given such a probability, the application may fix either the size of the probing 
set of nodes or the frequency at which the core needs to be re-established from the 
current set of nodes (with the help of an appropriate data transfer protocol). 

This section exploits the previous formula to relate these various elements. More 
precisely, we provide the various relations existing between the three factors that can 
be tuned by an application designer: the size of the probing set q, the frequency of the 
probing 6, and the probability of achieving a core characterized by p = 1 — e. (For 
the sake of clarity all along this section, a ratio C, or c, is sometimes expressed as a 
percentage. Floating point numbers on the y-axis are represented in their mantissa and 
exponent numbers.) 

Relation linking c and 5. The first parameter that we formalized is C, that can be 
interpreted as the rate of dynamism in the system. C depends both on the churn rate (c) 
observed in the system and the probing frequency (1/(5). More specifically, we foresee 
here a scenario in which an application designer would consider tolerating a churn C in 
order to define the size of a core and thus ensure the persistence of some critical data. 
For example, an application may need to tolerate a churn rate of 10% in the system, 
meaning that the persistence of some critical data should be ensured as long as up to 
10% of the nodes in the system change over time. Therefore, depending on the churn 
observed and monitored in the system, we are able to define the longest period 5 before 
which the core should be re-instantiated on a set of the current nodes. One of the main 
interest of linking c and 5 is that if c varies over time, 5 can be adapted accordingly 
without compromising the initial requirements of the application. 

More formally, Lemma[T]provides an explicit value of C (the ratio of initial nodes 
that are replaced) as a function of c (the replacement ratio per time unit) and 5 (the 
number of time units). Figure |2] represents this function for several values of C. More 
explicitly, it depicts on a logarithmic scale the curve c = 1 — ^/l — C (or equivalently, 
the curve S = ^^^j^^)- As an example, the curve associated with C = 10% indicates 
that 10% of the initial nodes have been replaced after 6 = 105 time units (point A, 
Figure |2]), when the replacement ratio is c = 10^'^ per time unit. Similarly, the same 
replacement ratio per time unit entails the replacement of 30% of the initial nodes when 
the duration we consider is 5 = 356 time units (point B, Figure|2]i. The system designer 
can benefit from these values to better appreciate the way the system evolves according 
to the assumed replacement ratio per time unit. To summarize, this result can be used 
as follows. In a system, aiming at tolerating a churn of X% of the nodes, our goal is to 
provide an application with the corresponding value of S, knowing the churn c observed 
in the system. This gives the opportunity to adjust S if c changes over time. 

' Monitoring the chum rate of a system, although very interesting, is out of the scope of this 
paper. 




Fig. 2. Evolution of the pair (c, S) for given values of C 

Relation linking the core size q and e Now, given a value C set by an application devel- 
oper, there are still two parameters that may influence either the overhead of maintaining 
a core in the system, or the probabilistic guarantee of having such a core. The overhead 
may be measured in a straightforward manner in this context as the number of nodes 
that need to be probed, namely q. Intuitively, for a given C, as q increases, the probabil- 
ity of probing a node of the initial core increases. In this section, we define how much 
these parameters are related. 

Let us consider the value e determined by Theorem[T] That value can be interpreted 
the following way: p = 1 — e is the probability that, at time r' = r + 5, one of the q 
queries issued (randomly) by a node hits a node of the core. An important question is 
then the following: How are e and q related? Or equivalently, how increasing the size of 
q enable to decrease e? This relation is depicted in FigureOa) where several curves are 
represented for n = 10, 000 nodes. 

Each curve corresponds to a percentage of the initial nodes that have been replaced. 
(As an example, the curve 30% corresponds to the case where C ~ 30% of the initial 
nodes have left the system; the way C, 5 and c are related has been seen previously.) 
Let us consider e = 10"'^. The curves show that q — 274 is a sufficient core size 
for not bypassing that value of e when up to 10% of the nodes are replaced (point A, 
Figure Oa)). Differently, q — 274 is not sufficient when up to 50% of the nodes are 
replaced; in that case, the size q = 369 is required (point B, Figure|3ja)). 

The curves of both Figure|2]and Figure[3ja) provide the system designer with realis- 
tic hints to set the value of S (deadline before which a data transfer protocol establishing 
a new core has to be executed). Figure |3lb) is a zoom of Figure |3l a) focusing on the 
small values of e. It shows that, when 10"'^ < e < 10^^, the probability p = 1 — e in- 
creases very rapidly towards 1, though the size of the core increases only very slightly. 
As an example, let us consider the curve associated with C — 10% in Figure [3lb). It 
shows that a core of q — 224 nodes ensures an intersection probability = 1 — e = 0.99, 
and a core of g = 274 nodes ensures an intersection probability — 1 — e — 0.999. 




Fig. 3. Non-intersection probability over the core size 

Interestingly, this phenomenon is similar to the birthday paradoj^ fSl that can be 
roughly summarized as follows. How many persons must be present in a room for two of 
them to have the same birthday with probability p = 1 — e? Actually, for that probability 
to be greater than 1/2, it is sufficient that the number of persons in the room be equal 
(only) to 23! When, there are 50 persons in the room, the probability becomes 97%, and 
increases to 99.9996% for 100 persons. In our case, we observe a similar phenomenon: 
the probability p = 1 — e increases very rapidly despite the fact that the frequency of 
the core size q increases slightly. 

In our case, this means that the system designer can choose to slightly increase the 
size of the probing set q (and therefore only slightly increase the associated overhead) 
while significantly increasing the probability to access a node of the core. 



Relation linking q and S So far, we have considered that an application may need to 
fix C and then define the size of the probing set to achieve a given probability p of 
success. There is another remaining trade-off that an application designer might want 
to decide upon: trading the size of the probing set with the probing frequency while 
fixing the probability p ~ 1 — e of intersecting the initial core. This is precisely defined 
by relating qto 6 for a fixed e. 

In the following we investigate the way the size and lifetime of the core are related 
when the required intersection probability is 99% or 99.9%. We chose these values to 
better illustrate our purpose, as we believe they reflect what could be expected by an ap- 
plication designer For both probabilities we present two different figures summarizing 
the required values of q. 

Figure|4]focuses on the core size that is required in a static system and in a dynamic 
system (according to various values of the ratio C). The static system implies that no 
nodes leave or join the system while the dynamic system contains nodes that join and 
leave the system depending on several churn values. For the sake of clarity we omit 



^ The paradox is with respect to intuition, not with respect to logics. 
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Fig. 4. The core size depending on the system sizes and the chum rate. 

values of 5 and simply present C taking several values from 10% to 80%. The analysis 
of the results depicted in the figure leads to two interesting observations. 

First, when S is big enough for 10% of the system nodes to be replaced, then the 
core size required is amazingly close to the static case (873 versus 828 when n = 10^ 
and the probability is 0.999). Moreover, q has to be equal to 990 only when C increases 
up to 30%. Second, even when S is sufficiently large to let 80% of the system nodes 
be replaced, the minimal number of nodes to probe remains low with respect to the 
system size. For instance, if 6 is sufficiently large to let 6, 000 nodes be replaced in a 
system with 10, 000 nodes, then only 413 nodes must be randomly probed to obtain an 
intersection with probability p = 0.999. 

To conclude, these results clearly show that a critical data in a highly dynamic sys- 
tem can persist in a scalable way: even though the delay between core re-establishments 
is reasonably large while the size of the core remains relatively low. 

5 Conclusion 

Maintenance of critical data in large-scale dynamic systems where nodes may join and 
leave dynamically is a critical issue. In this paper, we define the notion of persistent 
core of nodes that can maintain such critical data with a high probability regardless of 
the structure of the underlying P2P network. More specifically, we relate the parameters 
that can be tuned to achieve a high probabihty of defining a core, namely the size of the 
core, the frequency at which it has to be re-established, and the chum rate of the system. 

Our results provide application designers with a set of guidelines to tune the system 
parameters depending on the expected guarantees and the churn rate variation. An in- 
teresting outcome of this paper is to show that shghtly increasing the size of the core 
result in a significant probability increase of the guarantee. 

This work opens up a number of very interesting research directions. An interesting 
question is related to the design and evaluation of efficient probing protocols, defining 
such a core in the system applicable to a large spectrum of peer to peer overlay net- 
works. Monitoring the system in order to estimate the churn rate is another interesting 
issue. 
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